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Abstract 

> ■ 

QQ ' Efficient representations and solutions for large decision problems with continuous and 

. discrete variables are among the most important challenges faced by the designers of auto- 

' mated decision support systems. In this paper, we describe a novel hybrid factored Markov 

decision process (MDP) model that allows for a compact representation of these problems, 
and a new hybrid approximate linear programming (HALP) framework that permits their 
efhcient solutions. The central idea of HALP is to approximate the optimal value function 
by a linear combination of basis functions and optimize its weights by linear programming. 
We analyze both theoretical and computational aspects of this approach, and demonstrate 
its scale-up potential on several hybrid optimization problems. 
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' 1. Introduction 



A dynamic decision problem with components of uncertainty can be very often formulated as 
a Markov decision process (MDP). An MDP represents a controlled stochastic process whose 
dynamics is described by state transitions. Objectives of the control are modeled by rewards 
(or costs), which are assigned to state-action configurations. In the simplest form, the states 
and actions of an MDP are discrete and unstructured. These models can be solved efficiently 
by standard dynamic programming methods (Bellman, 1957; Puterman, 1994; Bertsekas & 
Tsitsiklis, 1996). 

Unfortunately, textbook models rarely meet the practice and its needs. First, real-world 
decision problems are naturally described in a factored form and may involve a combination 
of discrete and continuous variables. Second, there are no guarantees that compact forms of 
the optimal value function or policy for these problems exist. Therefore, hybrid optimization 
problems are usually discretized and solved approximately by the methods for discrete-state 
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MDPs. The contribution of this work is a principled, sound, and efficient approach to solving 
large-scale factored MDPs that avoids this discretization step. 

Our framework is based on approximate linear programming (ALP) (Schweitzer & Seid- 
mann, 1985), which has been already applied to solve decision problems with discrete state 
and action variables efficiently (Schuurmans k. Patrascu, 2002; de Farias & Van Roy, 2003; 
Guestrin et al., 2003). These applications include context-specific planning (Guestrin et al., 
2002), multiagent planning (Guestrin et al., 2002), relational MDPs (Guestrin et al., 2003), 
and first-order MDPs (Sanner k. Boutilier, 2005). In this work, we show how to adapt ALP 
to solving large-scale factored MDPs in hybrid state and action spaces. 

The presented approach combines factored MDP representations (Sections 3 and 4) and 
optimization techniques for solving large-scale structured linear programs (Section 6). This 
leads to various benefits. First, the quality and complexity of value function approximations 
is controlled by using basis functions (Section 3.2). Therefore, we can prevent an exponential 
blowup in the complexity of computations when other techniques cannot. Second, we always 
guarantee that HALP returns a solution. Its quality naturally depends on the choice of basis 
functions. As analyzed in Section 5.1, if these are selected appropriately, we achieve a close 
approximation to the optimal value function V*. Third, a well-chosen class of basis functions 
yields closed-form solutions to the backprojections of our value functions (Section 5.2). This 
step is important for solving hybrid optimization problems more efficiently. Finally, solving 
hybrid factored MDPs reduces to building and satisfying relaxed formulations of the original 
problem (Section 6). The formulations can be solved efficiently by the cutting plane method, 
which has been studied extensively in applied mathematics and operations research. 

For better readability of the paper, our proofs are deferred to Appendix A. The following 
notation is adopted throughout the work. Sets and their members are represented by capital 
and small italic letters as S and s, respectively. Sets of variables, their subsets, and members 
of these sets are denoted by capital letters as X, Xj, and Xj. In general, corresponding small 
letters represent value assignments to these objects. The subscripted indices D and C denote 
the discrete and continuous variables in a variable set and its value assignment. The function 
Dom(-) computes the domain of a variable or the domain of a function. The function Par( ) 
returns the parent set of a variable in a graphical model (Howard & Matheson, 1984; Dean 
k, Kanazawa, 1989). 

2. Markov Decision Processes 

Markov decision processes (Bellman, 1957) provide an elegant mathematical framework for 
modeling and solving sequential decision problems in the presence of uncertainty. Formally, 
a finite-state Markov decision process (MDP) is given by a 4-tuple M = (5, A, P, R), where 
S = {si, ... , Sn} is a set of states, A = {ai, ... , am} is a set of actions, P : <Sx^x»S — >■ [0, 1] 
is a stochastic transition function of state dynamics conditioned on the preceding state and 
action, and i?:<Sx,4— >-Misa reward function assigning immediate payoffs to state-action 
configurations. Without loss of generality, the reward function is assumed to be nonnegative 
and bounded from above by a constant -Rmax (Puterman, 1994). Moreover, we assume that 
the transition and reward models are stationary and known a priori. 

Once a decision problem is formulated as an MDP, the goal is to find a policy n : S A 
that maximizes some objective function. In this paper, the quality of a policy vr is measured 
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by the infinite horizon discounted reward: 



Ett 



t=o 



- if 



(1) 



where 7 € [0, 1) is a discount factor, s^^^ is the state at the time step t, and the expectation is 
taken with respect to all state-action trajectories that start in the states s^^^ and follow the 
policy TT thereafter. The states s^^^ are chosen according to a distribution (p. This optimality 
criterion assures that there exists an optimal policy vr* which is stationary and deterministic 
(Puterman, 1994). The policy is greedy with respect to the optimal value function V*, which 
is a fixed point of the Bellman equation (Bellman, 1957): 



V*{s) = max 



i2(s,a) + 7^P(s' I s,a)V*{s') 



(2) 



The Bellman equation plays a fundamental role in all dynamic programming (DP) methods 
for solving MDPs (Puterman, 1994; Bertsekas &: Tsitsiklis, 1996), including value iteration, 
policy iteration, and linear programming. The focus of this paper is on linear programming 
methods and their refinements. Briefly, it is well known that the optimal value function V* 
is a solution to the linear programming (LP) formulation (Manne, 1960): 

minimize ^ ip{s)V{s) (3) 

s 

subject to: V{s) > R{s, a) + 7 ^ P{s' \ s, a)V{s') \/ s € S,a e A; 

s' 

where V{s) represents the variables in the LP, one for each state s, and ip{s) > is a strictly 
positive weighting on the state space S. The number of constraints equals to the cardinality 
of the cross product of the state and action spaces |5 x ^|. 

Linear programming and its efficient solutions have been studied extensively in applied 
mathematics and operations research (Bertsimas & Tsitsiklis, 1997). The simplex algorithm 
is a common way of solving LPs. Its worst-case time complexity is exponential in the number 
of variables. The ellipsoid method (Khachiyan, 1979) offers polynomial time guarantees but 
it is impractical for solving LPs of even moderate size. 

The LP formulation (3) can be solved compactly by the cutting plane method (Bertsimas 
& Tsitsiklis, 1997) if its objective function and constraint space are structured. Briefly, this 
method searches for violated constraints in relaxed formulations of the original LP. In every 
step, we start with a relaxed solution V^^\ find a violated constraint given yW, add it to the 
LP, and resolve for a new vector The method is iterated until no violated constraint 

is found, so that V^^^ is an optimal solution to the LP. The approach has a potential to solve 
large structured linear programs if we can identify violated constraints efficiently (Bertsimas 
&: Tsitsiklis, 1997). The violated constraint and the method that found it are often referred 
to as a separating hyperplane and a separation oracle, respectively. 

Delayed column generation is based on a similar idea as the cutting plane method, which 
is applied to the column space of variables instead of the row space of constraints. Bender's 
and Dantzig- Wolfe decompositions reflect the structure in the constraint space and are often 
used for solving large structured linear programs. 
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3. Discrete-State Factored MDPs 

Many real-world decision problems are naturally described in a factored form. Discrete-state 
factored MDPs (Boutilier et al., 1995) allow for a compact representation of this structure. 

3.1 Factored Transition and Reward Models 

A discrete- state factored MDP (Boutilier et al., 1995) is a 4-tuple M. = (X, A, P, R), where 
X = {Xi, . . . , Xn} is a state space described by a set of state variables, A = {ai, . . . , a^} is 
a set of actions^, P(X' | X, A) is a stochastic transition model of state dynamics conditioned 
on the preceding state and action, and i? is a reward function assigning immediate payoffs to 
state-action configurations. The state of the system is completely observed and represented 
by a vector of value assignments x = (a^i, . . . , x„). We assume that the values of every state 
variable Xi are restricted to a finite domain Dom(Xj). 

Transition model: The transition model is given by the conditional probability distribu- 
tion P(X' I X,.4), where X and X' denote the state variables at two successive time steps. 
Since the complete tabular representation of P(X' | X, ,4) is infeasible, we assume that the 
transition model factors along X' as: 

n 

P(X' I X, a) = n I Par(XO, a) (4) 

1=1 

and can be described compactly by a dynamic Bayesian network (DBN) (Dean k. Kanazawa, 
1989). This DBN representation captures independencies among the state variables X and 
X' given an action a. One-step dynamics of every state variable is modeled by its conditional 
probability distribution P{X[ \ Par(Xj'), a), where Par(Xj') C X denotes the parent set of X[. 
Typically, the parent set is a subset of state variables which simplifies the parameterization 
of the model. In principle, the parent set can be extended to the state variables X'. Such an 
extension poses only few new challenges when solving the new problems efficiently (Guestrin, 
2003). Therefore, we omit the discussion on the modeling of intra-layer dependencies in this 
paper. 

Reward model: The reward model factors similarly to the transition model. In particular, 
the reward function -R(x, a) = Rji'Xj, a) is an additive function of local reward functions 
defined on the subsets X^ and A. In graphical models, the local functions can be described 
compactly by reward nodes Rj, which are conditioned on their parent sets Par(i?j) = ^jUA. 
To allow this representation, we formally extend our DBN to an influence diagram (Howard 
& Matheson, 1984). 

Example 1 (Guestrin et al., 2001) To illustrate the concept of a factored MDP, we con- 
sider a network administration problem, in which the computers are unreliable and fail. The 
failures of these computers propagate through network connections to the whole network. For 
instance, if the server Xi (Figure la) is down, the chance that the neighboring computer X2 

1. For simplicity of exposition, we discuss a simpler model, which assumes a single action variable A instead 
of the factored action space A = {^1, • . . , Am}- Our conclusions in Sections 3.1 and 3.3 extend to MDPs 
with factored action spaces (Guestrin et al., 2002). 
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Figure 1: a. Four computers in a ring topology. Direction of propagating failures is denoted 
by arrows, b. A graphical representation of factored transition and reward models 
after taking an action ai in the 4-ring topology. The future state of the server X[ 
is independent of the rest of the network because the server is rebooted. Reward 
nodes Ri and Rj {j > 2) denote the components 2xi and Xj {j > 2) of the reward 
model, c. A graphical representation of the linear value function approximation 
y^(x) = WQ + Y^f^i WiXi in the 4-ring topology. Reward nodes Hq and Hi (i > 1) 
denote the value function components wq and WiXi (i > 1). 



crashes increases. The administrator can prevent the propagation of the failures by rebooting 
computers that have already crashed. 

This network administration problem can be formulated as a factored MDP. The state of 
the network is completely observable and represented by n binary variables X = {Xi, . . . , Xn}, 
where the variable Xi denotes the state of the i-th computer: (being down) or 1 (running). 
At each time step, the administrator selects an action from the set A = {ai, . . . , On+i}- The 
action Oi (i < n) corresponds to rebooting the i-th computer. The last action a„+i is dummy. 
The transition function reflects the propagation of failures in the network and can be encoded 
locally by conditioning on the parent set of every computer. A natural m,etric for evaluating 
the performance of an administrator is the total number of running computers. This metric 
factors along the computer states Xi and can be represented compactly by an additive reward 
function: 

n 

i?(x, a) = 2xi + ^ Xj. 

The weighting of states establishes our preferences for maintaining the server Xi and work- 
stations X2, ■ ■ ■ , Xn- An example of transition and reward models after taking an action ai 
in the 4-ring topology (Figure la) is given in Figure lb. 

3.2 Solving Discrete-State Factored MDPs 

Markov decision processes can be solved by exact DP methods in polynomial time in the size 
of the state space X (Puterman, 1994). Unfortunately, factored state spaces are exponential 
in the number of state variables. Therefore, the DP methods are unsuitable for solving large 
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factored MDPs. Since a factored representation of an MDP (Section 3.1) may not guarantee 
a structure in the optimal value function or policy (Koller &; Parr, 1999), we resort to value 
function approximations to alleviate this concern. 

Value function approximations have been successfully applied to a variety of real-world 
domains, including backgammon (Tesauro, 1992, 1994, 1995), elevator dispatching (Crites 
& Barto, 1996), and job-shop scheduling (Zhang & Dietterich, 1995, 1996). These partial 
successes suggest that the approximate dynamic programming is a powerful tool for solving 
large optimization problems. 



In this work, we focus on linear value function approximation (Bellman et al., 1963; Van 
Roy, 1998): 



The approximation restricts the form of the value function to the linear combination of 

|w| basis functions /i(x), where w is a vector of optimized weights. Every basis function can 
be defined over the complete state space X, but usually is limited to a small subset of state 
variables Xj (Bellman et al., 1963; Koller &; Parr, 1999). The role of basis functions is similar 
to features in machine learning. They are often provided by domain experts, although there 
is a growing amount of work on learning basis functions automatically (Patrascu et al., 2002; 
Mahadevan, 2005; Kveton & Hauskrecht, 2006a; Mahadevan & Maggioni, 2006; Mahadevan 
et al., 2006). 

Example 2 To demonstrate the concept of the linear value function model, we consider the 
network administration problem (Example 1) and assume a low chance of a single computer 
failing. Then the value function in Figure Ic is sufficient to derive a close-to-optimal policy 
on the 4-ring topology (Figure la) because the indicator functions /i(x) = Xi capture changes 
in the states of individual computers. For instance, if the computer fails, the linear policy: 



immediately leads to rebooting it. If the failure has already propagated to the computer Xj+i, 
the policy recovers it in the next step. This procedure is repeated until the spread of the initial 
failure is stopped. 

3.3 Approximate Linear Programming 

Various methods for fitting of the linear value function approximation have been proposed 
and analyzed (Bertsekas & Tsitsiklis, 1996). We focus on approximate linear programming 
(ALP) (Schweitzer &; Seidmann, 1985), which recasts this problem as a linear program: 

minimizew ^ ^ 'i^j/i(x) (6) 





(5) 





X 



subject to: ^ Wifi{x) > -R(x, o) -|- 7 ^ P(x' | x, a) ^ Wifi{x') V x e X, a G ^; 



X' 
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where w represents the variables in the LP, V'(x) > are state relevance weights weighting 
the quahty of the approximation, and 7 P(x' | x, a) Wifi{x') is a discounted backpro- 
jection of the value function (Equation 5). The ALP formulation can be easily derived 
from the standard LP formulation (3) by substituting T^'*^(x) for F(x). The formulation is 
feasible if the set of basis functions contains a constant function /o(x) = 1. We assume that 
such a basis function is always present. Note that the state relevance weights are no longer 
enforced to be strictly positive (Section 1). Comparing to the standard LP formulation (3), 
which is solved by the optimal value function V* for arbitrary weights ip{s) > 0, a solution 
w to the ALP formulation depends on the weights tp{^)- Intuitively, the higher the weights, 
the higher the quality of the approximation in a corresponding state. 

Since our basis functions are usually restricted to subsets of state variables (Section 3.2), 
summation terms in the ALP formulation can be computed efficiently (Guestrin et al., 2001; 
Schuurmans &; Patrascu, 2002). For example, the order of summation in the backprojection 
term can be rearranged as 7 Y^,- Wi J2 , P{^'i \ x, a)fi{^':), which allows its aggregation in the 

i 

space of Xj instead of X. Similarly, a factored form of V'(x) yields an efficiently computable 
objective function (Guestrin, 2003). 

The number of constraints in the ALP formulation is exponential in the number of state 
variables. Fortunately, the constraints are structured. This results from combining factored 
transition and reward models (Section 3.1) with the linear approximation (Equation 5). As 
a consequence, the constraints can be satisfied without enumerating them exhaustively. 

Example 3 The notion of a factored constraint space is important for compact satisfaction 
of exponentially many constraints. To illustrate this concept, let us consider the linear value 
function (Example 2) on the 4-ring network administration problem (Example 1). Intuitively, 
by combining the graphical representations ofP(x.' \ x, ai), i2(x, ai) (Figure lb), andV^^x.) 
(Figure Ic), we obtain a factored model of constraint violations: 

T-(x,ai) = y-(x)-7^P(x'|x,ai)F-(x')-i?(x,ai) 

x' 

= ^Wifi{yi) - -iY^WiY^P{^i I x,ai)/i(x9 -i?(x,ai) 
i i x; 

4 

= luo + ^ WiXi - jwq - jwiP{x[ = 1 I ai)- 

i=l 

4 4 
7^u)iP(x- = 1 I Xi,Xi-i,ai) - 2x1 - ^xj. 

i=2 j=2 

for an arbitrary solution w (Figure 2a). Note that this cost function: 

4 4 

1=1 j=2 

is a linear combination of a constant (j)" inx, and univariate and bivariate functions (j)"{xi) 
and (f)^ (xi, Xi-i) . It can be represented compactly by a cost network (Guestrin et al, 2001), 
which is an undirected graph over a set of variables X. Two nodes in the graph are connected 
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(a) (b) 

Figure 2: a. A graphical representation of combining factored transition and reward models 
(Figure lb) with the linear approximation (Figure Ic). Reward nodes Gq and Gi 
(i > 1) represent the discounted backprojection terms —7^0 and —'ywix'j^ {i > 1). 
Gray regions are the cost components of the constraint space, b. A cost network 
corresponding to our factored constraint space (Figure 2a). The network captures 
pairwisc dependencies X1—X2, X2—X3, and X3—X4. The treewidth of the cost 
network is 1. 



if any of the cost terms depends on both variables. Therefore, the cost network corresponding 
to the function r^(x, ai) must contain edges X1—X2, X2—X3, and X3 — X4 (Figure 2b). 

Savings achieved by the compact representation of constraints are related to the efficiency 
of computing ax gmin:x.T^ CLi) (Guestrin, 2003). This computation can be done by variable 
elimination and its complexity increases exponentially in the width of the tree decomposition 
of the cost network. The smallest width of all tree decompositions is referred to as treewidth. 



Inspired by the factorization, Guestrin et al. (2001) proposed a variable-elimination method 
(Dechter, 1996) that rewrites the constraint space in ALP compactly. Schuurmans and Pa- 
trascu (2002) solved the same problem by the cutting plane method. The method iteratively 
searches for the most violated constraint: 



arg mm 



Y,wf /,(x,)-7^P(xnx,a)/,(xO 



— -R(x, a) 



(7) 



with respect to the solution w^*) of a relaxed ALP. The constraint is added to the LP, which 
is resolved for a new solution w(*+^) . This procedure is iterated until no violated constraint 
is found, so that w*^*) is an optimal solution to the ALP. 

The quality of the ALP formulation has been studied by de Farias and Van Roy (2003). 
Based on their work, we conclude that ALP yields a close approximation to the optimal 
value function V* if the weighted max-norm error — ^™^||cc i/L minimized. We 

return to this theoretical result in Section 5.1. 
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Theorem 1 (de Farias & Van Roy, 2003) Let w be a solution to the ALP formulation 
(6). Then the expected error of the value function can be bounded as: 



V* - V 



<^min||F*-y-|L ,,,, 



where is an Ci-norm weighted by the state relevance weights tp, L{'x.) = '^- fi{'x) is a 

Lyapunov function such that the inequality kL(x) > 7supaEp(x'|x,a)[-^(x')] holds, n G [0, 1) 
denotes its contraction factor, and \\-\\^ ^ji^ is a max-norm reweighted by the reciprocal 1/L. 

Note that the /^i-norm distance ||^* — ^ equals to the expectation E^|l/* — V'^\ over 

the state space X with respect to the state relevance weights ip- Similarly to Theorem 1, we 
utilize the £i and C^o norms in the rest of the work to measure the expected and worst-case 
errors of value functions. These norms are defined as follows. 

Definition 1 The Ci (Manhattan) and C^o (infinity) norms are typically defined as \\f\\i = 
|/(x)| and ll/lloo = maxx |/(x)|. If the state space X is represented by both discrete and 
continuous variables Xd and Xc, the definition of the norms changes accordingly: 

i = J2f l/WI 11/11^ = sup|/(x)| . (8) 



The following definitions: 

= E / V'W l/WI dxc and 11/11^^^ = supV'(x) |/(x)| (9) 



correspond to the Ci and C^o norms reweighted by a function V'(x). 



4. Hybrid Factored MDPs 

Discrete-state factored MDPs (Section 3) permit a compact representation of decision prob- 
lems with discrete states. However, real-world domains often involve continuous quantities, 
such as temperature and pressure. A sufficient discretization of these quantities may require 
hundreds of points in a single dimension, which renders the representation of our transition 
model (Equation 4) infeasible. In addition, rough and uninformative discretization impacts 
the quality of policies. Therefore, we want to avoid discretization or defer it until necessary. 
As a step in this direction, we discuss a formalism for representing hybrid decision problems 
in the domains of discrete and continuous variables. 



4.1 Factored Transition and Reward Models 

A hybrid factored MDP (HMDP) is a 4-tuplc M = (X, A, P, R), where X = {Xi, . . . , X„} is 
a state space described by state variables, A = {Ai, . . . , A^} is an action space described by 
action variables, P(X' | X, A) is a stochastic transition model of state dynamics conditioned 
on the preceding state and action, and i? is a reward function assigning immediate payoffs to 
state-action configurations.^ 

2. General state and action space MDP is an alternative term for a hybrid MDP. The term hybrid does not 
refer to the dynamics of the model, which is discrete-time. 
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P(Xi) P(X^ 1X2 = 0) P(X^|X2 = l,Xi=0) P(X^ 1 X2 = l,Xi = 1) 




X^ X2 X2 X2 

Figure 3: Transition functions for continuous variables X[ and X2 after taking an action ai 
in the 4-ring topology (Example 4). The densities are shown for extreme values 
of their parent variables Xi and X2. 



State variables: State variables are either discrete or continuous. Every discrete variable 
Xi takes on values from a finite domain Dom(Xj). Following Hauskrccht and Kveton (2004), 
we assume that every continuous variable is bounded to the [0, 1] subspacc. In general, this 
assumption is very mild and permits modeling of any closed interval on M. The state of the 
system is completely observed and described by a vector of value assignments x = (xd, xc) 
which partitions along its discrete and continuous components X£) and x^. 

Action variables: The action space is distributed and represented by action variables A. 
The composite action is defined by a vector of individual action choices a = {an, ac) which 
partitions along its discrete and continuous components ao and ac- 

Transition model: The transition model is given by the conditional probability distribu- 
tion P(X' I X, A), where X and X' denote the state variables at two successive time steps. 
We assume that this distribution factors along X' as P(X' | X, A) = OLi ^i^i I PaK^i)) 
and can be described compactly by a DBN (Dean &; Kanazawa, 1989). Typically, the parent 
set Par(Xj') C X U A is a small subset of state and action variables which allows for a local 
parameterization of the transition model. 

Parameterization of our transition model: One-step dynamics of every state variable 
is described by its conditional probability distribution P{Xl \ Par(Xj')). If X^ is a continuous 
variable, its transition function is represented by a mixture of beta distributions (Hauskrecht 
& Kveton, 2004): 

P{Xi = X I Par(XO) = ^ijPheUx I aj,l5j) (10) 

i 

where vTjj is the weight assigned to the j-th component of the mixture, and ctj = i;^>^ (Par(Xj')) 
and Pj = (/)^.(Par(X-)) are arbitrary positive functions of the parent set. The mixture of beta 
distributions provides a very general class of transition functions and yet allows closed- form 
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solutions^ to the expectation terms in HALP (Section 5). If every j3j = 1, Equation 10 turns 
into a polynomial in X'-. Due to the Weierstrass approximation theorem (Jeffreys & Jeffreys, 
1988), such a polynomial is sufficient to approximate any continuous transition density over 
X'- with any precision. If X'- is a discrete variable, its transition model is parameterized by 
|Dom(Xj')| nonnegative discriminant functions 9j = (j)fj{Par{Xl)) (Guestrin et al., 2004): 

pW=,|Pa.(x;))= (11) 

Note that the parameters aj, Pj, and 6j (Equations 10 and 11) are functions instantiated by 
value assignments to the variables Par(X'-) C XU A. We keep separate parameters for every 
state variable X- although our indexing does not reflect this explicitly. The only restriction 
on the functions is that they return valid parameters for all state-action pairs (x, a). Hence, 

I Dom(JY') I 

we assume that aj(x,a) > 0, /3j(x, a) > 0, ^j(x, a) > 0, and Ylj^i ' ^i(x, a) > 0. 

Reward model: The reward model factors similarly to the transition model. In particular, 
the reward function -R(x, a) = J2j ^ji^j^^j) is an additive function of local reward functions 
defined on the subsets and A^ . In graphical models, the local functions can be described 

compactly by reward nodes Rj, which are conditioned on their parent sets Par(/?j) =XjUAj. 
To allow this representation, we formally extend our DBN to an influence diagram (Howard 
&; Matheson, 1984). Note that the form of the reward functions Rj{xj,SLj) is not restricted. 

Optimal value function and policy: The optimal policy vr* can be defined greedily with 
respect to the optimal value function V*, which is a fixed point of the Bellman equation: 

y*(x) = sup[i?(x,a) + 7Ep(,,|x,a)[^*(x')]] (12) 
a 

= sup 

a 

Accordingly, the hybrid Bellman operator T* is given by: 

T y (x) = sup [ii(x, a) + 7Ep(x'|x,a) [V{^')] ] • (13) 

a 

In the rest of the paper, we denote expectation terms over discrete and continuous variables 
in a unified form: 

Ep(x)[/(x)] = ^ / P(x)/(x)dxc. (14) 

Example 4 (Hauskrecht &: Kveton, 2004) Continuous-state network administration is 

a variation on Example 1, where the computer states are represented by continuous variables 
on the interval between (being down) and 1 (running). At each time step, the administrator 

3. The term closed-form refers to a generally accepted set of closed-form operations and functions extended 
by the gamma and incomplete beta functions. 



i?(x,a) + 7j] / P(x'|x,a)F*(x')dx^ 



163 



KVETON, HAUSKRECHT, & GUESTRIN 



selects a single action from the set A = {oi, . . . , a„+i}. The action ai (i <n) corresponds to 
rebooting the i-th computer. The last action On+i is dummy. The transition model captures 
the propagation of failures in the network and is encoded locally by beta distributions: 



P{Xi = X I Par(X;)) = Py,^t^{x \ a,P) 



a = 20 
P = 2 

a = 2 + 13xi - 5xiE[Par(X;)] 
P = \0-2xi- 6a;jE[Par(X^)] 



a ^ i 



where the variables xi and E[Par(Xj')] denote the state of the i-th computer and the expected 
state of its parents. Note that this transition function is similar to Example 1. For instance, 
in the 4-ring topology, the modes of transition densities for continuous variables X[ and X'2 
after taking an action a\ (Figure 3): 



P{X[ \a = ai) 
P{X'2 \X2 = 0,a = ai) 



0.95 P{X'<2 I = 1, Xi = 0, o = ai) « 0.67 
0.10 P(X^ I X2 = 1, Xi = 1, a = ai) = 0.90 



equal to the expected values of their discrete counterparts (Figure lb). The reward function 
is additive: 

n 

R{x, a) =2xl + Y^ 

and establishes our preferences for maintaining the server Xi and workstations X2, . . . , X„ . 
4.2 Solving Hybrid Factored MDPs 

Value iteration, policy iteration, and linear programming are the most fundamental dynamic 
programming methods for solving MDPs (Puterman, 1994; Bertsekas & Tsitsiklis, 1996). 
Unfortunately, none of these techniques is suitable for solving hybrid factored MDPs. First, 
their complexity is exponential in the number of state variables if the variables are discrete. 
Second, the methods assume a finite support for the optimal value function or policy, which 
may not exist if continuous variables are present. Therefore, any feasible approach to solving 
arbitrary HMDPs is likely to be approximate. In the rest of the section, we review two major 
classes of methods for approximating value functions in hybrid domains. 



Grid-based approximation: Grid-based methods (Chow & Tsitsikhs, 1991; Rust, 1997) 
transform the initial state space X into a set of grid points G = {x(^\ . . . , x*^^) }. The points 
are used to estimate the optimal value function Vq on the grid, which in turn approximates 
V*. The Bellman operator on the grid is defined as (Rust, 1997): 



raF(x«) 



max 
a 



i2(xW,a) + 7> ^Poi^^^ I xW,a)F(x^^^) 



N 

i=i 



(i) I ^{i) 



(15) 



,aj = 



^'-i(xW)P(x(-'') 

(i) I 



where Pd'x.^^'^ \ x^*) 
ized by the term ^'av^ ■ 7 — z^j=i 
of the value function Vq by standard techniques for solving discrete-state MDPs 



I x^*^, a) is a transition function, which is nornial- 
The operator Tq allows the computation 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
initial basis function weights w^'^^ 



a set of states G = {x 



(1) 



} 



Algorithm: 

t = 

while a stopping criterion is not met 
for every state x^-'^ 

for every basis function /i(x) 

X,, = /,(x(^)) 

yj = maxa i?(x(-'\ a) + 7Ep(x/|xO),a) 
w(*+i) = (X^X)-iX^y 
t = t + l 



Outputs: 

basis function weights w'*^ 



Figure 4: Pseudo-code implementation of the least-squares value iteration (£2 VI) with the 
linear value function approximation (Equation 5). The stopping criterion is often 



based on the number of steps or the £2-norm error 



At) 



At) 



measured 



on the set G. Our discussion in Sections 5.2 and 6 provides a recipe for an efficient 



implementation of the backup operation T*V'"^^\'x 



Rust (1997) analyzed the convergence of these methods for random and pseudo-random 
samples. Clearly, a uniform discretization of increasing precision guarantees the convergence 
of Vq to V* but causes an exponential blowup in the state space (Chow &: Tsitsiklis, 1991). 
To overcome this concern, Munos and Moore (2002) proposed an adaptive algorithm for non- 
uniform discretization based on the Kuhn triangulation. Ferns et al. (2005) analyzed metrics 
for aggregating states in continuous-state MDPs based on the notion of bisimulation. Trick 
and Zin (1993) used linear programming to solve low-dimensional problems with continuous 
variables. These continuous variables were discretized manually. 

Parametric value function approximation: An alternative approach to solving factored 
MDPs with continuous-state components is the approximation of the optimal value function 
V* by some parameterized model (Bertsekas &; Tsitsiklis, 1996; Van Roy, 1998; Gordon, 
1999). The parameters A are typically optimized iteratively by applying the backup operator 
T* to a finite set of states. The least-squares error \\V^ — 7~*V^''^||2 is a commonly minimized 
error metric (Figure 4). Online updating by gradient methods (Bertsekas Sz Tsitsiklis, 1996; 
Sutton & Barto, 1998) is another way of optimizing value functions. The limitation of these 
techniques is that their solutions are often unstable and may diverge (Bertsekas, 1995). On 
the other hand, they generate high-quality approximations. 
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Parametric approximations often assume fixed value function models. However, in some 
cases, it is possible to derive flexible forms of that combine well with the backup operator 
T*. For instance, Sondik (1971) showed that convex picccwisc linear functions are sufficient 
to represent value functions and their DP backups in partially-observable MDPs (POMDPs) 
(Astrom, 1965; Hauskrecht, 2000). Based on this idea, Feng et al. (2004) proposed a method 
for solving MDPs with continuous variables. To obtain full DP backups, the value function 
approximation is restricted to rectangular piecewise linear and convex (RPWLC) functions. 
Further restrictions are placed on the transition and reward models of MDPs. The advantage 
of the approach is its adaptivity. The major disadvantages are restrictions on solved MDPs 
and the complexity of RPWLC value functions, which may grow exponentially in the number 
of backups. As a result, without further modifications, this approach is less likely to succeed 
in solving high-dimensional and distributed decision problems. 



5. Hybrid Approximate Linear Programming 

To overcome the limitations of existing methods for solving HMDPs (Section 4.2), we extend 
the discrete-state ALP (Section 3.3) to hybrid state and action spaces. We refer to this novel 
framework as hybrid approximate linear programming (HALF). 

Similarly to the discrete-state ALP, HALP optimizes the linear value function approxi- 
mation (Equation 5). Therefore, it transforms an initially intractable problem of computing 
V* in the hybrid state space X into a lower dimensional space of w. The HALP formulation 
is given by a linear program^: 

minimizCw Wiai (16) 

i 

subject to: ^ WiFi{yi, a) — i?(x, a) > V x G X, a G A; 

i 

where w represents the variables in the LP, aj denotes basis function relevance weight: 

ai = E^(,)[/,(x)] (17) 
= Y.f V'(x)/i(x)dxc, 



tp{x.) > is a state relevance density function that weights the quality of the approximation, 
and -Fi(x, a) = /i(x) — ^gi{x, a) denotes the difference between the basis function /j(x) and 
its discounted backprojection: 



gi{x,a) = Ep(x'|x,a)[/i(x')] 

= J2 /^^'(x'|x,a)/,(x')dx'c,. 



(18) 



4. More precisely, the HALP formulation (16) is a linear semi-infinite optimization problem with an infinite 
number of constraints. The number of basis functions is finite. For brevity, we refer to this optimization 
problem as linear programming. 



166 



Solving Factored MDPs with Hybrid State and Action Variables 



Vectors xd (x^) and xc (x^) are the discrete and continuous components of value assign- 
ments X (x') to all state variables X (X'). The linear program can be rewritten compactly: 

minimizcw E^[V^] (19) 
subject to: - T*V^ > 

by using the Bellman operator T*. 

The HALP formulation reduces to the discrete-state ALP (Section 3.3) if the state and 
action variables are discrete, and to the continuous-state ALP (Hauskrecht Sz Kveton, 2004) 
if the state variables are continuous. The formulation is feasible if the set of basis functions 
contains a constant function /o(x) = 1. We assume that such a basis function is present. 

In the rest of the paper, we address several concerns related to the HALP formulation. 
First, we analyze the quality of this approximation and relate it to the minimization of the 
max-norm error \\V* — V^\\^, which is a commonly-used metric (Section 5.1). Second, we 
present rich classes of basis functions that lead to closed-form solutions to the expectation 
terms in the objective function and constraints (Equations 17 and 18). These terms involve 
sums and integrals over the complete state space X (Section 5.2), and therefore are hard to 
evaluate. Finally, we discuss approximations to the constraint space in HALP and introduce 
a framework for solving HALP formulations in a unified way (Section 6). Note that complete 
satisfaction of this constraint space may not be possible since every state-action pair (x, a) 
induces a constraint. 



5.1 Error Bounds 

The quality of the ALP approximation (Section 3.3) has been studied by de Farias and Van 
Roy (2003). We follow up on their work and extend it to structured state and action spaces 
with continuous variables. Before we proceed, we demonstrate that a solution to the HALP 
formulation (16) constitutes an upper bound on the optimal value function V*. 

Proposition 1 Let w be a solution to the HALP formulation (16). Then V"" > V*. 

This result allows us to restate the objective E^[y*^] in HALP. 

Proposition 2 Vector w is a solution to the HALP formulation (16): 

minimizcw \y^] 
subject to: - TV^ > 



if and only if it solves: 



minimizew 1 1 ^* ~ 1 1 1 ^ 
subject to: - TV^ > 0; 



where is an Ci-norm weighted by the state relevance density function and T* is the 

hybrid Bellman operator. 
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Based on Proposition 2, we conclude that HALP optimizes the linear value function approxi- 
mation with respect to the reweighted >Ci-norm error — ^- The following theorem 
draws a parallel between minimizing this objective and max-norm error \\V* — V^\\^. More 
precisely, the theorem says that HALP yields a close approximation to the optimal value 
function V* if V* is close to the span of basis functions /i(x). 

Theorem 2 Letw be an optimal solution to the HALP formulation (16). Then the expected 
error of the value function can he hounded as: 



V* - 



< -^minllF* -y^ll , 
i,V> ~ 1-7 w " 



where ^ is an Ci-norm weighted by the state relevance density function tp and \\-\\^ is a 
max-norm. 

Unfortunately, Theorem 2 rarely yields a tight bound on \\V* — V^W^ ^. First, it is hard to 
guarantee a uniformly low max-norm error ||V^* — V^^Hqq if dimensionality of a problem 
grows but the basis functions /i(x) are local. Second, the bound ignores the state relevance 
density function V'(x) although this one impacts the quality of HALP solutions. To address 
these concerns, we introduce non-uniform weighting of the max-norm error in Theorem 3. 

Theorem 3 Letw be an optimal solution to the HALP formulation (16). Then the expected 
error of the value function can he hounded as: 



V* - V 



2E^[L] . 
< — - — mm \/ —V\\^T/T. 

1,-4,- 1-K w " lloo,l/L. 



where ^ is an C\-norm weighted by the state relevance density i/j, L(x) = fi{x) is a 

Lyapunov function such that the inequality kL(x) > 7SupaEp(x'|x,a)[-^(x')] holds, k G [0, 1) 
denotes its contraction factor, and \\-\\^ is a max-norm reweighted by the reciprocal 1/L. 

Note that Theorem 2 is a special form of Theorem 3 when L(x) = 1 and k = 7. Therefore, 
the Lyapunov function L{x) permits at least as good bounds as Theorem 2. To make these 
bounds tight, the function L{x.) should return large values in the regions of the state space, 
which are unimportant for modeling. In turn, the reciprocal 1/L(x) is close to zero in these 
undesirable regions, which makes their impact on the max-norm error \\V* — V'^W^ -yjj^ less 
likely. Since the state relevance density function ■0(x) reflects the importance of states, the 
term \L\ should remain small. These two factors contribute to tighter bounds than those 
by Theorem 2. 

Since the Lyapunov function L(x) = w!^fi(x) lies in the span of basis functions /i(x). 
Theorem 3 provides a recipe for achieving high-quality approximations. Intuitively, a good 
set of basis functions always involves two types of functions. The first type guarantees small 
errors |y*(x) — y^(x)| in the important regions of the state space, where the state relevance 
density V'(x) is high. The second type returns high values where the state relevance density 
^(x) is low, and vice versa. The latter functions allow the satisfaction of the constraint space 
yw y q-*yvi -j^ unimportant regions of the state space without impacting the optimized 
objective function \\V* — V^'^-^ ^. Note that a trivial value function y^(x) = (1 — 7)^^-Rmax 
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satisfies all constraints in any HALP but unlikely leads to good policies. For a comprehensive 
discussion on selecting appropriate ip{'x.) and L{-x.), refer to the case studies of de Farias and 
Van Roy (2003). 

Our discussion is concluded by clarifying the notion of the state relevance density V'(x). 
As demonstrated by Theorem 4, its choice is closely related to the quality of a greedy policy 
for the value function V"" (de Farias & Van Roy, 2003). 



Theorem 4 Letw be an optimal solution to the HALP formulation (16). Then the expected 
error of a greedy policy: 



u(x) = argsup i?(x, a) + 7Ep(x/|x,a) 



can he hounded as: 



ll^*-^'^lli,.< 



1-7 



V* - 



where ||-||i ^ and |Hli ^^^^ are weighted Ci-norms, is a value function for the greedy policy 
u, and ^u,v is the expected frequency of state visits generated by following the policy u given 
the initial state distribution v. 



Based on Theorem 4, we may conclude that the expected error of greedy policies for HALP 
approximations is bounded when ip = fiu,u- Note that the distribution iJ,u,u is unknown when 
optimizing because it is a function of the optimized quantity itself. To break this cycle, 
de Farias and Van Roy (2003) suggested an iterative procedure that solves several LPs and 
adapts iJ,u,u accordingly. In addition, real-world control problems exhibit a lot of structure, 
which permits the guessing of fiu.u- 

Finally, it is important to realize that although our bounds (Theorems 3 and 4) build a 
foundation for better HALP approximations, they can be rarely used in practice because the 
optimal value function V* is generally unknown. After all, if it was known, there is no need 
to approximate it. Moreover, note that the optimization of \\V* — ^^||oo i/l (Theorem 3) is 
a hard problem and there are no methods that would minimize this error directly (Patrascu 
et al., 2002). Despite these facts, both bounds provide a loose guidance for empirical choices 
of basis functions. In Section 7, we use this intuition and propose basis functions that should 
closely approximate unknown optimal value functions V*. 



5.2 Expectation Terms 

Since our basis functions are often restricted to small subsets of state variables, expectation 
terms (Equations 17 and 18) in the HALP formulation (16) should be efficiently computable. 
To unify the analysis of these expectation terms, E^(x)[/i(x)] and Ep(-x'|x,a)[/i(x')]i show 
that their evaluation constitutes the same computational problem Ep^x) [/i(x)]) where -P(x) 
denotes some factored distribution. 

Before we discuss expectation terms in the constraints, note that the transition function 
P(x' I X, a) is factored and its parameterization is determined by the state-action pair (x, a). 
We keep the pair (x, a) fixed in the rest of the section, which corresponds to choosing a single 
constraint (x, a). Based on this selection, we rewrite the expectation terms Ep(x'|x,a)[/i(^')] 
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in a simpler notation Ep(x') > where P(x') = P(x' | x, a) denotes a factored distribution 

with fixed parameters. 

We also assume that the state relevance density function ^{x.) factors along X as: 

n 

V'(x) = JJVila^i), (20) 

i=l 

where ijji{xi) is a distribution over the random state variable Xj. Based on this assumption, 
we can rewrite the expectation terms E^(x)[/i(x)] in the objective function in a new notation 
Ep(x)[/i(x)], where P(x) = ^(x) denotes a factored distribution. In line with our discussion 
in the last two paragraphs, efficient solutions to the expectation terms in HALP are obtained 
by solving the generalized term Ep(x)[/i(x)] efficiently. We address this problem in the rest 
of the section. 

Before computing the expectation term Ep(x)[/i(x)] over the complete state space X, we 
recall that the basis function /i(x) is defined on a subset of state variables Xj. Therefore, we 
may conclude that Ep(x) [/i(x)] = Ep(x.) [/?(xi)] , where P(xj) denotes a factored distribution 
on a lower dimensional space Xj. If no further assumptions are made, the local expectation 
term Ep(xj) [/i(xi)] may be still hard to compute. Although it can be estimated by a variety 
of numerical methods, for instance Monte Carlo (Andrieu et al., 2003), these techniques are 
imprecise if the sample size is small, and quite computationally expensive if a high precision 
is needed. Consequently, wc try to avoid such an approximation step. Instead, we introduce 
an appropriate form of basis functions that leads to closed-form solutions to the expectation 
term Ep(x.)[/i(xj)]. 

In particular, let us assume that every basis function /i(xj) factors as: 

Mxi) = fioi^to) fic i^ic ) (21) 

along its discrete and continuous components fij^i^ij-^) and fi^{xi^), where the continuous 
component further decouples product: 

fici^ic)= n (22) 

of univariate basis function factors fij{xj). Note that the basis functions remain multivariate 
despite the two independence assumptions. We make these presumptions for computational 
purposes and they are relaxed later in the section. 

Based on Equation 21, we conclude that the expectation term: 

Ep(x,)[/*(^*)] = Ep(x^)[/j^(xj^)/jp(xj^)] 

= Ep(x,^)[/i^(xi^)]Ep(x,^)[/i^(xi^)] (23) 

decomposes along the discrete and continuous variables Xj^ and Xj^, where Xj = (xj^,Xj^) 
and -P(xj) = P(xj^)P(xi^). The evaluation of the discrete part Ep(x.^)[/i£,(xi^)] requires 
aggregation in the subspace Xj^ : 

Ep(x,^)[/^,(x.,)] = ^P(x,,)/,,(x,,), (24) 
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f poly ('<2 ) f beta (^2 ) ^ pwl (><2 ) 




0.5 1 0.5 1 0.5 1 



X2 ^2 X2 

Figure 5: Expectation of three basis functions /(X2) (Example 5) with respect to the transi- 
tion function P{X2 \ X2 = 1, Xi = 0, a = ai) from Figure 3. Every basis function 
f{x2) is depicted by a thick black line. The transition function is shown in a light 
gray color. Darker gray lines represent the values of the product P{x2 \ x, ai)/(x2). 
The area below corresponds to the expectation terms ^p(x'2\y.,ai)[f 



which can be carried out efficiently in Odlx gx |I-^om(Xj)|) time (Section 3.3). Following 
Equation 22, the continuous term ^p{^ )[/io(^ic)] decouples as a product: 



n 



(25) 



Xj GXjp 



where 'Ep(^,j..-^[fij{xj)] represents the expectation terms over individual random variables Xj. 
Consequently, an efficient solution to the local expectation term Ep(x.)[/j(xj)] is guaranteed 
by efficient solutions to its univariate components Ep(^x.^[fij{xj)]. 

In this paper, we consider three univariate basis function factors fij{xj): piecewise linear 
functions, polynomials, and beta distributions. These factors support a very general class of 
basis functions and yet allow closed- form solutions to the expectation terms Ep(^j..-^[fij{xj)]. 
These solutions are provided in the following propositions and demonstrated in Example 5. 

Proposition 3 (Polynomial basis functions) Let: 

P{x) = Phetaix I a,l3) 

be a beta distribution over X and: 

f{x) = x^{i-xr 

be a polynomial in x and {\ — x). Then Ep(-j.)[/(a;)] has a closed-form solution: 

p r.^^M- r(a + /3) T{a + nW + m) 
WJ - r(a)r(/3) T{a + (5 + n + m)- 
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Corollary 1 (Beta basis functions) Let: 

P{x) = Pbeta(x I a, (3) 
f{x) = Phets.{x I a/,/?/) 

be beta distributions over X. Then Ep(2;) [/(x)] has a closed-form solution: 

T{a + T{af + (3f) r{a + a/ - l)r(/3 + /?/ - 1) 



Ep(.)[/(x)] 



r(a)r(/3) r(a^)r(/3j) r(a + + ^ + - 2) 



Proof: A direct consequence of Proposition 3. Since integration is a distributive operation, 
our claim straightforwardly generalizes to the mixture of beta distributions P{x). □ 

Proposition 4 (Piecewise linear basis functions) Let: 

P{x) = -Pbeta(a; I a,/3) 

be a beta distribution over X and: 

i 

he a piecewise linear (PWL) function in x, where 1[/ ,,.,](,t) represents the indicator function 
of the interval [li,ri\. Then Ep(2,)[/(x)] has a closed-form solution: 



Ep(.)[/(^)] = E 



^i^^i^^in) - F+ik)) + biiFin) - F{k)) 

a-\- p 



where F{u) = -Fbcta('w I /3) and F'^{u) = -Fbeta('" I Q; + 1, /0) denote the cumulative density 
functions of beta distributions. 

Example 5 Efficient closed-form solutions to the expectation terms in HA LP are illustrated 
on the 4-ring network administration problem (Example 4) with three hypothetical univariate 
basis functions: 

/poly (2^2) ~ -^2 

/beta(^2) ~ -Pbeta(a^2 I 2,6) 

/pwi(4) = 1 [0.3,0.5] (4) - 1-5) + 1(0.5,0.7] (4)(-54 + 3.5) 

Suppose that our goal is to evaluate expectation terms in a single constraint that corresponds 
to the network state x= (0, 1, 0, 0) and the administrator rebooting the server. Based on these 
assumptions, the expectation terms in the constraint (x, ai) simplify as: 

Ep(x'|x,oi) [/(4)] = Ep(a;yx,ai) [/(4)] > 

where the transition function P{x2 \ x, ai) is given by: 

P{x2 I x,ai) = P{X!^ = X2 I X2 = = 0,a = ai) 

= Pbeta(4 I 15,8). 
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Closed-form solutions to the simplified expectation terms Ep(j.^|x_a^)[/(iC2)] are computed as: 



Ep(4|x,ai) [/poly(a;2)] 
(Proposition 3) 

Ep(a;^|x,ai) [/beta (3:^2)] 
(Corollary 1) 



Pbeta(4 I 15, 8)X2^ da;2 



r(i5 + 8) r(i.5 + 4)r(8) 
r(i5)r(8) r(i5 + 8 + 4) 
0.20 



E 



P(a;2|x,ai) 



^beta(4 I 15,8)Pbeta(4 I 2, 6) dx'2 

r(i5 + 8) r(2 + 6) r(i5 + 2 - i)r(8 + 6-1) 
r(i5)r(8) r(2)r(6) r(i5 + 2 + 8 + 6-2) 

0.22 

[/pwK^)] = / ^beta(4 I 15,8)l[o.3,0.5](4)(54 - l-5)dX2 + 

/ Pbeta{x2 I 15, 8)l[o.5,o.7] {x2)i-5x2 + 3.5) da;2 



(Proposition 4) 



15 



15 + 8 

5^ 

^5 + 8 

0.30 



(F+(0.5) - F+(0.3)) - 1.5(F(0.5) - -F(0.3))- 
(F+(0.7) - F+(0.5)) + 3.5(F(0.7) - F{0.5)) 



where F{u) = -Fbeta(^ I 15,8) and F~^{u) = -Fbcta(^ I 15 + 1,8) denote the cumulative density 
functions of beta distributions. A graphical interpretation of these computations is presented 
in Figure 5. Brief inspection verifies that the term Ep(2.^|x_aj)[/pwi(a^2)] indeed the largest 
one. 

Up to this point, we obtained efficient closed- form solutions for factored basis functions and 
state relevance densities. Unfortunately, the factorization assumptions in Equations 20, 21, 
and 22 are rarely justified in practice. In the rest of the section, we show how to relax them. 
In Section 6, we apply our current results and propose several methods that approximately 
satisfy the constraint space in HALP. 

5.2.1 Factored State Relevance Density Functions 

Note that the state relevance density function '^(x) is very unlikely to be completely factored 
(Section 5.1). Therefore, the independence assumption in Equation 20 is extremely limiting. 
To relax this assumption, we approximate V'(x) by a linear combination i/^'^(x) = Lj£ip£{x.) 



of factored state relevance densities ipei'^) 
in the objective function decompose as: 



niLi i^eii^i)- As a result, the expectation terms 



Ev>'^(x)[/i(x)] = E' 



^a;^E^^(x)[/i(x)], 



(26) 
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where the factored terms E^^(x) can be evaluated efficiently (Equation 23). Moreover, 

if we assume the factored densities are polynomials, their linear combination ^p^{x) is a 

polynomial. Due to the Weierstrass approximation theorem (Jeffreys & Jeffreys, 1988), this 
polynomial is sufficient to approximate any state relevance density ^(x) with any precision. 
It follows that the linear combinations permit state relevance densities that reflect arbitrary 
dependencies among the state variables X. 

5.2.2 Factored Basis Functions 

In line with the previous discussion, note that the linear value function V^^k) = J2i Wifi{^) 
with factored basis functions (Equations 21 and 22) is sufficient to approximate the optimal 

value function V* within any max-norm error ^V* — V^^||oo- Based on Theorem 2, we know 
that the same set of basis functions guarantees a bound on the /!i-norm error ||^* — ^^||;^ ^• 
Therefore, despite our independence assumptions (Equations 21 and 22), we have a potential 
to obtain an arbitrarily close HALF approximation to V* . 

6. Constraint Space Approximations 

An optimal solution w to the HALF formulation (16) is determined by a finite set of active 
constraints at a vertex of the feasible region. Unfortunately, identification of this active set 
is a hard computational problem. In particular, it requires searching through an exponential 
number of constraints, if the state and action variables are discrete, and an infinite number 
of constraints, if any of the variables are continuous. As a result, it is in general infeasible to 
find the optimal solution w to the HALF formulation. Hence, we resort to approximations 
to the constraint space in HALF whose optimal solution w is close to w. This notion of an 
approximation is formalized as follows. 

Definition 2 The HALP formulation is relaxed.- 



if only a subset C of its constraints is satisfied. 

The HALF formulation (16) can be solved approximately by solving its relaxed formulations 
(27). Several methods for building and solving these approximate LFs have been proposed: 
Monte Carlo sampling of constraints, (Hauskrecht &: Kveton, 2004), e-grid discretization of 
the constraint space (Guestrin et al., 2004), and an adaptive search for a violated constraint 
(Kveton &; Hauskrecht, 2005). In the remainder of this section, we introduce these methods. 
From now on, we denote optimal solutions to the complete and relaxed HALF formulations 
by the symbols w and w, respectively. 

Before we proceed, note that while is an upper bound on the optimal value function 
V* (Figure 6a), the relaxed value function docs not have to be (Figure 6b). The reason 
is that the relaxed HALF formulation does not guarantee that the constraint > 7~*y^ is 
satisfied for all states x. As a result, we cannot simply use Froposition 1 to prove > V*. 



minimize, 



w 




(27) 
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V* and V* V* and V* w and w 




(a) (b) (c) 

Figure 6: a. Graphical relation between the value function V* and its HALF approximation 
V^. The function is guaranteed to be an upper bound on V*. b. The relaxed 
HALF approximation may not lead to an upper bound, c. Graphical relation 
between the optimal and relaxed solutions w and w. The feasible regions of the 
complete and relaxed HALF formulations are shown in dark and light gray colors. 
The value function approximations and are typically nonlinear in the state 
space X but always linear in the space of parameters w. 



Furthermore, note that the inequality [V^] < [V^] always holds because the optimal 
solution w is feasible in the relaxed HALF (Figure 6c). These observations become helpful 
for understanding the rest of the section. 

6.1 MC-HALP 

In the simplest case, the constraint space in HALF can be approximated by its Monte Carlo 
(MC) sample. In such a relaxation, the set of constraints C is selected with respect to some 
proposal distribution ip over state-action pairs (x, a). Since the set C is finite, it establishes 
a relaxed formulation (27), which can be solved by any LF solver. An algorithm that builds 
and satisfies relaxed MC-HALF formulations is outlined in Figure 7. 

Constraint sampling is easily applied in continuous domains and its space complexity is 
proportional to the number of state and action components. Hauskrccht and Kveton (2004) 
used it to solve continuous-state factored MDFs and further refined it by heuristics (Kveton 
Sz Hauskrecht, 2004). In discrete-state domains, the quality of the sampled approximations 
was analyzed by de Farias and Van Roy (2004). Their result is summarized by Theorem 5. 

Theorem 5 (de Farias & Van Roy, 2004) Letw be a solution to the ALP form.ulation 
(6) andw be a solution to its relaxed formulation whose constraints are sampled with respect 
to a proposal distribution ip over state-action pairs (x, a). Then there exist a distribution ip 
and sample size: 

N>0(^(Kln^+lnl]] 
- V(l-7)eV (l-7)e Sj) 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
a proposal distribution (p 

Algorithm: 

initialize a relaxed HALP formulation with an empty set of constraints 

t = 

while a stopping criterion is not met 
sample (x, a) ~ 

add the constraint (x, a) to the relaxed HALP 
t = t+l 

solve the relaxed MC-HALP formulation 

Outputs: 

basis function weights w 



Figure 7: Pseudo-code implementation of the MC-HALP solver. 
such that with probability at least 1 — S: 



V* - 


< 1 


V* - V"" 









where is an Ci-norm weighted by the state relevance weights ip, 9 is a problem- specific 

constant, A and K denote the numbers of actions and basis functions, and e and 5 are scalars 
from the interval (0, 1). 

Unfortunately, proposing a sampling distribution that guarantees this polynomial bound 
on the sample size is as hard as knowing the optimal policy tt* (de Farias & Van Roy, 2004). 
This conclusion is parallel to those in importance sampling. Note that uniform Monte Carlo 
sampling can guarantee a low probability of constraints being violated but it is not sufficient 
to bound the magnitude of their violation (de Farias & Van Roy, 2004). 

6.2 e-HALP 

Another way of approximating the constraint space in HALP is by discretizing its continuous 
variables Xc and Ac on a uniform e-grid. The new discretized constraint space preserves its 
original factored structure but spans discrete variables only. Therefore, it can be compactly 

satisfied by the methods for discrete-state ALP (Section 3.3). An algorithm that builds and 
satisfies relaxed e-HALP formulations is outlined in Figure 8. Note that the new constraint 
space involves exponentially many constraints 0([l/e -|- 1] l-^c^l+'-^c-l) in the number of state 
and action variables Xc and A^. 

6.2.1 Error Bounds 

Recall that the e-HALP formulation approximates the constraint space in HALP by a finite 
set of equally-spaced grid points. In this section, we study the quality of this approximation 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
grid resolution e 



Algorithm: 

discretize continuous variables Xc and Ac into [1/e + 1] equally-spaced values 

identify subsets Xj and A^ (Xj and Aj) corresponding to the domains of Fi{x,a.) (i?j(x, a)) 

evaluate Fj(xj,£ij) {Rj{xj,aj)) for all configurations Xj and {xj and a^) on the e-grid 

calculate basis function relevance weights 

solve the relaxed e-HALP formulation (Section 3.3) 

Outputs: 

basis function weights w 



Figure 8: Pseudo-code implementation of the sr-HALP solver. 



and bound it in terms violating constraints in the complete HALP. More precisely, we prove 
that if a relaxed HALP solution w violates the constraints in the complete HALP by a small 
amount, the quality of the approximation is close to V". In the next section, we extend 
this result and relate to the grid resolution s. Before we proceed, we quantify our notion 
of constraint violation. 

Definition 3 Let w be an optimal solution to a relaxed HALP formulation (27). The vector 
w is (^-infeasible if: 

yv> _ ^yw > (28) 

where T* is the hybrid Bellman operator. 

Intuitively, the lower the (5-infeasibility of a relaxed HALP solution w, the closer the quality 
of the approximation V" to V" . Proposition 5 states this intuition formally. In particular, 
it says that the relaxed HALP formulation leads to a close approximation to the optimal 
value function V* if the complete HALP does and the solution w violates its constraints by 
a small amount. 



Proposition 5 Let w he an optimal solution to the HALP formulation (16) and w he an 
optimal 6-infeasible solution to its relaxed formulation (27). Then the expected error of the 
value function can he hounded as: 

25 



V* - 


< 


Y* _ 


+ 




1,1/) 




i,i/> 



1-7' 



where 



11,1/. 



is an C\-norm weighted by the state relevance density function tp. 



Based on Proposition 5, we can generalize our conclusions from Section 5.1 to relaxed HALP 
formulations. For instance, we may draw a parallel between optimizing the relaxed objective 



El/) [V^^] and the max- norm error — 



loo,l/L' 
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Theorem 6 Let w be an optimal 6-infeasible solution to a relaxed HALP formulation (27). 
Then the expected error of the value function can be bounded as: 



V* - V 



where is an C\-norm weighted by the state relevance density i/j, L{yi) = '^- wf fi{yi) is a 

Lyapunov function such that the inequality nLix) > 7 sup^ Ep(x'|x,a) [^(^')] holds, k e [0, 1) 
denotes its contraction factor, and \\-\\^ is a max-norm reweighted by the reciprocal 1/L. 

Proof: Direct combination of Theorem 3 and Proposition 5. □ 
6.2.2 Grid Resolution 

In Section 6.2.1, we bounded the error of a relaxed HALP formulation by its (5-infeasibility 
(Theorem 6), a measure of constraint violation in the complete HALP. However, it is unclear 
how the grid resolution e relates to (5-infeasibility. In this section, we analyze the relationship 
between e and S. Moreover, we show how to exploit the factored structure in the constraint 
space to achieve the 5-infeasibility of a relaxed HALP solution w efficiently. 

First, let us assume that w is an optimal (5-infeasible solution to an e-HALP formulation 
and Z = XU A is the joint set of state and action variables. To derive a bound relating both 
e and 5, we assume that the magnitudes of constraint violations t^(z) = WiFi{z) — R{z) 
are Lipschitz continuous. 

Definition 4 The function /(x) is Lipschitz continuous if: 

|/(x)-/(xO| <K||x-x'||^ Vx,x'gX; (29) 
where K is referred to as a Lipschitz constant. 

Based on the e-grid discretization of the constraint space, we know that the distance of any 
point z to its closest grid point xq = argmiuz/ ||z — z'||^ is bounded as: 

I|z-zg|Ioo<|- (30) 

Prom the Lipschitz continuity of r^(z), we conclude: 

r-(zG) - r-(z)| < K ||zg - z||^ < ^. (31) 

Since every constraint in the relaxed e-HALP formulation is satisfied, t^{zg) is nonnegative 
for all grid points zq- As a result, Equation 31 yields t^(z) > —Ke/2 for every state-action 
pair z = (x, a). Therefore, based on Definition 3, the solution w is (5-infeasible for 6 > Ke/2. 
Conversely, the J-infeasibility of w is guaranteed by choosing e < 26/ K. 

Unfortunately, K may increase rapidly with the dimensionality of a function. To address 
this issue, we use the structure in the constraint space and demonstrate that this is not our 
case. First, we observe that the global Lipschitz constant i^giob is additive in local Lipschitz 
constants that correspond to the terms WiFi{z) and Rj{z). Moreover, Kgi^b < NKi^c, where 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
initial basis function weights w^'^^ 
a separation oracle O 

Algorithm: 

initialize a relaxed HALP formulation with an empty set of constraints 
t = 

while a stopping criterion is not met 

query the oracle O for a violated constraint {xo,ao) with respect to w^*^ 
if the constraint {xo,aLo) is violated 

add the constraint to the relaxed HALP 
resolve the LP for a new vector w^*"*"^^ 
t = t+l 

Outputs: 

basis function weights w^*) 



Figure 9: Pseudo-code implementation of a HALP solver with the cutting plane method. 

N denotes the total number of the terms and Ki^c is the maximum over the local constants. 
Finally, parallel to Equation 31, the (5-infeasibility of a relaxed HALP solution w is achieved 
by the discretization: 

26 25 
e < < 

Since the factors WiFi{z) and Rj{z) are often restricted to small subsets of state and action 
variables, K\oc should change a little when the size of a problem increases but its structure is 
fixed. To prove that Ki^c is bounded, we have to bound the weights idi. If all basis functions 
axe of unit magnitude, the weights Wi are intuitively bounded as \wi\ < (1— 7)~^-Rmax, where 
-Rmax denotes the maximum one-step reward in the HMDP. 

Based on Equation 32, we conclude that the number of discretization points in a single 
dimension \l/e + 1] is bounded by a polynomial in N, Kioc, and 1/6. Hence, the constraint 
space in the relaxed e-HALP formulation involves 0{[NK\oc{l/6)]\-^^'^\^\) constraints, where 
|X| and I A| denote the number of state and action variables. The idea of variable elimination 
can be used to write the constraints compactly by 0([A^i^^ioc(l/(^)]"^^^ (|X| + | Aj)) constraints 
(Example 3), where T is the treewidth of a corresponding cost network. Therefore, satisfying 
this constraint space is polynomial in N, K\oc, 1/6, |X|, and |A|, but still exponential in T. 

6.3 Cutting Plane Method 

Both MC and e-HALP formulations (Sections 6.1 and 6.2) approximate the constraint space 
in HALP by a finite set of constraints C. Therefore, they can be solved directly by any linear 
programming solver. However, if the number of constraints is large, formulating and solving 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
basis function weights w 
grid resolution e 

Algorithm: 

discrctizc continuous variables Xc and Ac into ([l/e + 1]) equally-spaced values 
identify subsets X^ and A^ (Xj and Aj) corresponding to the domains of ^^(x, a) {Rj{x,a)) 
evaluate Fj(xj,ai) (i?j(xj,aj)) for all configurations Xj and cij (xj and a^) on the £-grid 
build a cost network for the factored cost function: 

T^(x, a) = J2i WiFi{x, a) - i?(x, a) 
find the most violated constraint in the cost network: 

(xo.ac)) = argminx,a-r'^(x,a) 

Outputs: 

state-action pair {xo,8lo) 



Figure 10: Pseudo-code implementation of the £-HALP separation oracle O^. 

LPs with the complete set of constraints is infeasible. In this section, we show how to build 
relaxed HALP approximations efficiently by the cutting plane method. 

The cutting plane method for solving HALP formulations is outlined in Figure 9. Briefly, 
this approach builds the set of LP constraints incrementally by adding a violated constraint 
to this set in every step. In the remainder of the paper, we refer to any method that returns 
a violated constraint for an arbitrary vector w as a separation oracle. Formally, every HALP 
oracle approaches the optimization problem: 



arg mm 
x,a 



F^(x)-7Ep(,,|,,,) y^(xO -i2(x,a) . (33) 



Consequently, the problem of solving hybrid factored MDPs efficiently reduces to the design 
of efficient separation oracles. Note that the cutting plane method (Figure 9) can be applied 
to suboptimal solutions to Equation 33 if these correspond to violated constraints. 

The presented approach can be directly used to satisfy the constraints in relaxed e-HALP 
formulations (Schuurmans & Patrascu, 2002) . Briefly, the solver from Figure 9 iterates until 
no violated constraint is found and the e-HALP separation oracle (Figure 10) returns the 
most violated constraint in the discretized cost network given an intermediate solution w^*) . 
Note that although the search for the most violated constraint is polynomial in |X| and | A| 
(Section 6.2.2), the running time of our solver does not have to be (Guestrin, 2003). In fact, 
the number of generated cuts is exponential in |X| and |A| in the worst case. However, the 
same oracle embedded into the ellipsoid method (Khachiyan, 1979) yields a polynomial-time 
algorithm (Bertsimas & Tsitsiklis, 1997). Although this technique is impractical for solving 
large LPs, we may conclude that our approach is indeed polynomial-time if implemented in 
this particular way. 

Finally, note that searching for the most violated constraint (Equation 33) has applica- 
tion beyond satisfying the constraint space in HALP. For instance, computation of a greedy 
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policy for the value function V^: 



arg max 
a 



arg nun 
a 



i?(x,a)+7Ep(x'|x,a)[v^^(x') 



(34) 



is almost an identical optimization problem, where the state variables X are fixed. Moreover, 
the magnitude of the most violated constraint is equal to the lowest 6 for which the relaxed 
HALP solution w is (5-infeasible (Equation 28): 



S = min 



y^(x) — max 



i?(x,a) +7Ep(x'|x,a) 



mm 

x,a L 



y^(x) - i?(x,a) - 7Ep(x'|x,a) ^"^(x') 



(35) 



6.4 MCMC-HALP 

In practice, both MC and e-HALP formulations (Sections 6.1 and 6.2) are built on a blindly 
selected set of constraints C. More specifically, the constraints in the MC-HALP formulation 
are chosen randomly (with respect to a prior distribution ip) while the e-HALP formulation 
is based on a uniform e-grid. This discretized constraint space preserves its original factored 
structure, which allows for its compact satisfaction. However, the complexity of solving the 
e-HALP formulation is exponential in the treewidth of its discretized constraint space. Note 
that if the discretized constraint space is represented by binary variables only, the treewidth 
increases by a multiplicative factor of log2 [1 /e + 1] , where [1 /e + 1] denotes the number of 
discretization points in a single dimension. Consequently, even if the treewidth of a problem 
is relatively small, solving its e-HALP formulation becomes intractable for small values of e. 

To address the issues of the discussed approximations (Sections 6.1 and 6.2), we propose 
a novel Markov chain Monte Carlo (MCMC) method for finding the most violated constraint 
of a relaxed HALP. The procedure directly operates in the domains of continuous variables, 
takes into account the structure of factored MDPs, and its space complexity is proportional 
to the number of variables. This separation oracle can be easily embedded into the ellipsoid 
or cutting plane method for solving linear programs (Section 6.3), and therefore constitutes 
a key step towards solving HALP efficiently. Before we proceed, we represent the constraint 
space in HALP compactly and state an optimization problem for finding violated constraints 
in this factored representation. 

6.4.1 Compact Representation of Constraints 

In Section 3.3, we showed how the factored representation of the constraint space allows for 
its compact satisfaction. Following this idea, we define violation magnitude r^(x, a): 

r-(x,a) = - [y-(x) -7Ep(,,|,,,)[F-(x')] -i?(x,a)] (36) 
= -^Wi [fi (x) - jgi (x, a)] + i?(x, a) , 

i 

which measures the amount by which the solution w violates the constraints in the complete 
HALP. We represent the magnitude of violation (x, a) compactly by an infiuence diagram 
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(ID), where X and A are decision nodes, and X' are random variables. This representation 
is built on the transition model P(X' | X, A), which is factored and captures independencies 
among the variables X, X', and A. We extend the diagram by three types of reward nodes, 
one for each term in Equation 36: Hi = —Wifi(x) for every basis function, Gi = 'ywifi{x.') for 
every backprojection, and Rj = Rj{'Xj,a.j) for every local reward function. The construction 
is completed by adding arcs that graphically represent the dependencies of the reward nodes 
on the variables. Finally, we can verify that: 



Consequently, the decision that maximizes the expected utility in the ID corresponds to the 
most violated constraint. A graphical representation of the violation magnitude r^(x, a) on 
the 4-ring network administration problem (Example 4) is given in Figure 2a. The structure 
of the constraint space is identical to Example 3 if the basis functions are univariate. 

We conclude that any algorithm for solving IDs can be applied to find the most violated 
constraint. However, most of these methods (Cooper, 1988; Jensen et al., 1994; Ortiz, 2002) 
are restricted to discrete variables. Fortunately, special properties of the ID representation 
allow its further simplification. If the basis functions are chosen conjugate to the transition 
model (Section 5.2), we obtain a closed-form solution to the expectation term Ep(x'|x,a)[G'i] 
(Equation 18), and the random variables X' are marginalized out of the diagram. The new 
representation contains no random variables and is known as a cost network (Section 3.3). 

Note that the problem of finding the most violated constraint in the ID representation is 
also identical to finding the maximum a posteriori (MAP) configuration of random variables 
in Bayesian networks (Dechter, 1996; Park & Darwiche, 2001, 2003; Yuan et al., 2004). The 
latter problem is difficult because of the alternating summation and maximization operators. 
Since we marginalized out the random variables X', we can solve the maximization problem 
by standard large-scale optimization techniques. 

6.4.2 Separation Oracle Omcmc 

To find the most violated constraint in the cost network, we apply the Metropolis-Hastings 
(MH) algorithm (Metropolis et al., 1953; Hastings, 1970) and propose a Markov chain whose 
invariant distribution converges to the vicinity of argmaxz t^(z), where z = (x, a) is a value 
assignment to the joint set of state and action variables Z = X U A. 

In short, the Metropolis-Hastings algorithm defines a Markov chain that transits between 
an existing state z and a proposed state z* with the acceptance probability: 



where q{z* \ z) and p{z) are a proposal distribution and a target density, respectively. Under 
mild restrictions on p(z) and q{z* \ z), the frequency of state visits generated by the Markov 
chain always converges to the target function p{z) (Andrieu et al., 2003). In the remainder of 
this section, we discuss the choices of p{z) and q{z* | z) to solve our optimization problem.^ 

5. For an introduction to Markov chain Monte Carlo (MCMC) methods, refer to the work of Andrieu et al. 




(37) 



J 




(38) 



(2003). 
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Target density: The violation magnitude t'*^(z) is turned into a density by the transforma- 
tion p{z) = exp[r^(z)]. Due to its monotonic character, p{z) retains the same set of global 
maxima as t^(z). Therefore, the search for argmax^ r^(z) can be done on the new function 
p{z). To prove that p(z) is a density, we demonstrate that /^^ p(z) dzc is a normalizing 
constant, where zd and zq are the discrete and continuous parts of the value assignment z. 
First, note that the integrand zc is restricted to the space [0, Ijl^c'L As a result, the integral 
J^^ p{z) dzc is proper if p{z) is bounded, and hence it is Riemann integrable and finite. To 
prove that p{z) = exp[r^(z)] is bounded, we bound the magnitude of violation t"*^(z). If all 
basis functions are of unit magnitude, the weights wi can be bounded as \wi \ < (1— 7)~''^-Rmax 
(Section 6.2.2), which in turn yields the bound |t^(z)| < (|w| (1— 7)~^ + l)-Rmax- Therefore, 
p(z) is bounded and can be treated as a density function. 

To find the mode of p{z), we employ simulating annealing (Kirkpatrick et al., 1983) and 
generate a non-homogeneous Markov chain whose invariant distribution is equal to p^/-^* (z), 
where Tt is a cooling schedule such that limt_j.oo Tt = 0. Under weak regularity assumptions 
on p{z). p°^(z) is a probability density that concentrates on the set of the global maxima of 
p(z) (Andrieu et al., 2003). If our cooling schedule Tf decreases such that Tf > c/lii{t + 1), 
where c is a problem-specific constant, the chain from Equation 38 converges to the vicinity 
of argmaxz r^(z) with the probability converging to 1 (Geman &; Geman, 1984). However, 
this logarithmic cooling schedule is slow in practice, especially for a high initial temperature 
c. To overcome this problem, we select a smaller value of c (Geman &i Geman, 1984) than is 
required by the convergence criterion. Therefore, the convergence of our chain to the global 
optimum argmaxzr^(z) is no longer guaranteed. 



Proposal distribution: We take advantage of the factored character of Z and adopt the 
following proposal distribution (Geman & Geman, 1984): 

a(z* \z)-i P^^^ I ^-'^ = f39) 

' " \ otherwise ' 

where z_j and z*_^ are value assignments to all variables but Zi in the original and proposed 
states. If Zi is a discrete variable, its conditional: 

p{z* I Z_i) = Pi^^y^^i-^^^h^i+^^-'-^^n^m) 

can be derived in a closed form. If Zi is a continuous variable, a closed form of its cumulative 
density function is unlikely to exist. To sample from the conditional, we embed another MH 
step within the original chain. In the experimental section, we use the Metropolis algorithm 
with the acceptance probability: 

where Zi and z* are the original and proposed values of the variable Zj. Note that sampling 
from both conditionals can be performed in the space of t^(z) and locally. 
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Inputs: 

a hybrid factored MDP M = (X, A, P, R) 
basis functions /o(x), /i(x), /2(x), . . . 
basis function weights w 

Algorithm: 

initialize a state-action pair z^*^ 

t = 

while a stopping criterion is not met 
for every variable Zi 
sample u ~ U[o,i] 
sample z* p{Zi \ zL*-) 



else 

update Tt+i according to the cooling schedule 

t = t+l 
(xc),ao) = zW 

Outputs: 

state-action pair (xc),ac)) 



Figure 11: Pseudocode implementation of the MCMC-HALP oracle Omcmc- The symbol 
U[o,i] denotes the uniform distribution on the interval [0, 1]. Since the testing for 
violated constraints (Figure 9) is inexpensive, our implementation of the MCMC- 
HALP solver in Section 7 tests all constraints z*^*-* generated by the Markov chain 
and not only the last one. Therefore, the separation oracle Omcmc returns more 
than one constraint per chain. 



Finally, by assuming that z*_,- = z_j (Equation 39), we derive a non-homogenous Markov 
chain with the acceptance probability: 

. /. P^/^*(z*Mz|z*)l 

A[z, z ) = mm < 1 



min < 1, 



= min < 1 



pV^t(z)g(z* I z) j 
py^*{zt\zU)p'/^^{zU)p{zi\zU) 

pVT.(^. |z_,y/T.(z_,)p(z*|z_,) 

p^/^*{z* I z_,y/^*(z_,)p(z, I z_,) ' 
py^*{zi I z^i)py^t{z-i)p{z* I z-i) 

py^*-^iz* I z_,) 

'pVT*-l(^. |Z_,) 



rnmll f^'l'yiy: }, (42) 
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which converges to the vicinity of the most violated constraint. Yuan et al. (2004) proposed 
a similar chain for finding the MAP configuration of random variables in Bayesian networks. 

6.4.3 Constraint Satisfaction 

If the MCMC-HALP separation oracle Omcmc (Figure 11) converges to a violated constraint 
(not necessarily the most violated) in polynomial time, the ellipsoid method is guaranteed to 
solve HALP formulations in polynomial time (Bertsimas k, Tsitsiklis, 1997). Unfortunately, 
convergence of our chain within arbitrary precision requires an exponential number of steps 
(Geman & Geman, 1984). Although the bound is loose to be of practical interest, it suggests 
that the time complexity of proposing violated constraints dominates the time complexity of 
solving relaxed HALP formulations. Therefore, the oracle Cmcmc should search for violated 
constraints efficiently. Convergence speedups that directly apply to our work include hybrid 
Monte Carlo (HMC) (Duane et al., 1987), Rao-Blackwellization (CaseUa k Robert, 1996), 
and slice sampling (Higdon, 1998). 

7. Experiments 

Experimental section is divided in three parts. First, we show that HALP can solve a simple 
HMDP problem at least as efficiently as alternative approaches. Second, we demonstrate the 
scale-up potential of our framework and compare several approaches to satisfy the constraint 
space in HALP (Section 6) . Finally, we argue for solving our constraint satisfaction problem 
in the domains of continuous variables without discretizing them. 

All experiments are performed on a Dell Precision 380 workstation with 3.2GHz Pentium 
4 CPU and 2GB RAM. Linear programs are solved by the simplex method in the LP_SOLVE 
package. The expected return of policies is estimated by the Monte Carlo simulation of 100 
trajectories. The results of randomized methods are additionally averaged over 10 randomly 
initialized runs. Whenever necessary, we present errors on the expected values. These errors 
correspond to the standard deviations of measured quantities. The discount factor 7 is 0.95. 

7.1 A Simple Example 

To illustrate the ability of HALP to solve factored MDPs, we compare it to £2 (Figure 4) and 
grid-based value iteration (Section 4.2) on the 4-ring topology of the network administration 
problem (Example 4). Our experiments are conducted on uniform and non-uniform grids of 
varying sizes. Grid points are kept fixed for all compared methods, which allows for their fair 
comparison. Both value iteration methods are iterated for 100 steps and terminated earlier 
if their Bellman error drops below 10~^. Both the £2 and HALP methods approximate the 
optimal value function V* by a linear combination of basis functions, one for each computer 
Xi (/j(x) =Xj), and one for every connection Xi —?■ Xj in the ring topology {fi^j{x.)=XiXj). 
We assume that our basis functions are sufficient to derive a one-step lookahead policy that 
reboots the least efficient computer. We believe that such a policy is close-to-optimal in the 
ring topology. The constraint space in the complete HALP formulation is approximated by 
its MC-HALP and e-HALP formulations (Sections 6.1 and 6.2). The state relevance density 
function ^l^{x) is uniform. Our experimental results are reported in Figure 12. 
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Figure 12: Comparison of three approaches to solving hybrid MDPs on the 4-ring topology 
of the network administration problem (Example 4) . The methods are compared 
on uniform and non-uniform grids of varying size (A'') by the expected discounted 
reward of policies and their computation time (in seconds). 



To verify that our solutions are non-trivial, we compare them to three heuristic policies: 
dummy, random, and server. The dummy policy TTdummy (x) = 05 always takes the dummy 
action 05. Therefore, it establishes a lower bound on the performance of any administrator. 
The random policy behaves randomly. The server policy 7rserver(x) = ai protects the server 
Xi. The performance of our heuristics is shown in Figure 12. Assuming that we can reboot 
all computers at each time step, a Utopian upper bound on the performance of any policy tt 
can be derived as: 



^7*i?(xt,7r(xi)) 



< maxEp(x'|x,a) 



< 



max i?(x , a 

a' 

max / 2P{x{ \ x, a)xf -|- ^ P{x'j \ x, a)xf dx' 
/ PheUx I 20,2)x'2dx' 

1 - 7 



83.0. 



(43) 



We do not analyze the quality of HALP solutions with respect to the optimal value function 
V* (Section 5.1) because this one is unknown. 

Based on our results, we draw the following conclusions. First, grid-based value iteration 
is not practical for solving hybrid optimization problems of even small size. The main reason 
is the space complexity of the method, which is quadratic in the number of grid points N. 
If the state space is discretized uniformly, N is exponential in the number of state variables. 
Second, the quality of the HALP policies is close to the £2 VI policies. This result is positive 
since £2 value iteration is commonly applied in approximate dynamic programming. Third, 
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both the £2 and HALP approaches yield better policies than grid-based value iteration. This 
result is due to the quality of our value function estimator. Its extremely good performance 
for e = 1 can be explained from the monotonicity of the reward and basis functions. Finally, 
the computation time of the C2 VI policies is significantly longer than the computation time 
of the HALP policies. Since a step of £2 value iteration (Figure 4) is as hard as formulating 
a corresponding relaxed HALP, this result comes at no surprise. 



7.2 Scale- up Potential 

To illustrate the scale-up potential of HALP, we apply three relaxed HALP approximations 
(Section 6) to solve two irrigation network problems of varying complexity. These problems 
are challenging for state-of-the-art MDP solvers due to the factored state and action spaces. 

Example 6 (Irrigation network operator) An irrigation network is a system of irriga- 
tion channels connected by regulation devices (Figure 13). The goal of an irrigation network 
operator is to route water between the channels to optimize water levels in the whole system. 
The optimal levels are determined by the type of a planted crop. For simplicity of exposition, 
we assume that all irrigation channels are oriented and of the same size. 

This optimization problem can be formulated as a factored MDP. The state of the network 
is completely observable and represented by n continuous variables X = {Xi, . . . , X„}, where 
the variable Xi denotes the water level in the i-th channel. At each time step, the irrigation 
network operator regulates m devices Ai that pump water between every pair of their inbound 
and outbound channels. The operation modes of these devices are described by discrete action 
variables A = {^1, . . . , A„i}. Inflow and outflow devices (no inbound or outbound channels) 
are not controlled and just pump water in and out of the network. 

The transition model reflects water flows in the irrigation network and is encoded locally 
by conditioning on the operation modes A: 



PiXUj = X I Par(X;^^.)) ^ PheUx \ a,/?) 



a = + 2 

^ = 46(l-/xU,-) + 2 

h 

l^i^j = Xi^j — X lai_>j^.fc {Aj) m.in{xi^j,Tj) 
k 

where Xi^j represents the water level between the regulation devices Ai and Aj, la,^^i^j (Ai 
andla^^■^f.{Aj) denote the indicator functions of water routing actions Oh^i^j and ai 



at the devices Ai and Aj, and Ti and tj are the highest tolerated flows through these devices. 
In short, this transition model conserves water mass in the network and adds some variance 
to the resulting state X[^- . The introduced indexing of state and action variables is explained 
on the 6-ring irrigation network in Figure 14a. In the rest of the paper, we assume an inflow 
of 0.1 to any inflow device Ai (ti = 0.1 ), an outflow of 1 from any outflow device Aj (Tj = 1), 
and the highest tolerated flow of 1/3 at the remaining devices A^ (t^ = 1/3). 

The reward function i?(x, a) = J2j Pjixj) is factored along individual irrigation channels 
and described by the univariate function: 



Rj{xj) = 2x 



3 
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Figure 13: Illustrations of three irrigation network topologies: a. 6-ring, b. 6-ring-of-rings, 

and c. 3 X 3 grid. Irrigation channels and their regulation devices are represented 
by arrows and rectangles. Inflow and outflow nodes are colored in light and dark 
gray. The ring and ring-of-rings networks are parameterized by the total number 
of regulation devices except for the last four (n). 



for each outflow channel (one of its regulation devices must be outflow), and by the function: 



Af{xj I 0.4, 0.025) Af{xj \ 0.55, 0.05) 



25.6 



32 



for the remaining channels (Figure 14b)- Therefore, we reward both for maintaining optimal 
water levels and pumping water out of the irrigation network. Several examples of irrigation 
network topologies are shown in Figure 13. 

Similarly to Equation 43, we derive a Utopian upper bound on the performance of any policy 
TT in an arbitrary irrigation network as: 



^7*^(xt,vr(xi)) 



< 



1-7 



0.2nin + (n - nout) 



max / Pbeta(a;' | 46x + 2, 46(1 - x) + 2)R{x') dx' 

^ Jx> 



(44) 



where n is the total number of irrigation channels, and riout denote the number of inflow 
and outflow channels, respectively, and R{x') =N{x' \ 0.4, 0.025)/25.6+A/'(x' | 0.55, 0.05)/32. 
We do not analyze the quality of HALP solutions with respect to the optimal value function 
V* (Section 5.1) because this one is unknown. 

In the rest of the section, we illustrate the performance of three HALP approximations, 
MC-HALP, £-HALP, and MCMC-HALP (Section 6), on the ring and ring-of-rings topologies 
(Figure 13) of the irrigation network problem. The constraints in the MC-HALP formulation 
are sampled uniformly at random. This establishes a baseline for all HALP approximations. 
The £-HALP and MCMC-HALP formulations are generated iteratively by the cutting plane 
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Figure 14: a. Indexing used in the description of the transition function in Example 6. The 
parameters h, i, j, and k are equal to 6, 7, 10, and 1, respectively, b. Univariate 
reward functions over water levels Xj (Example 6). c. Univariate basis functions 
over water levels Xj. 



method. The MCMC oracle Omcmc is simulated for 500 steps from the initial temperature 
c = 0.2, which leads to a decreasing cooling schedule from Tq = 0.2 to T500 ~ 0.02. These 
parameters are selected empirically to demonstrate the characteristics of the oracle Omcmc 
rather than to maximize its performance. The value function V* is approximated by a linear 
combination of four univariate piecewise linear basis functions for each channel (Figure 14c). 
Wc assume that our basis ftmctions arc sufRcicnt to derive a one-step lookahead policy that 
routes water between the channels if their water levels are too high or too low (Figure 14b). 
We believe that such a policy is close-to-optimal in irrigation networks. The state relevance 
density function ip{yi) is uniform. Our experimental results are reported in Figures 15-17. 

Based on the results, we draw the following conclusions. First, all HALF approximations 
scale up in the dimensionality of solved problems. As shown in Figure 16, the return of the 
policies grows linearly in n. Moreover, the time complexity of computing them is polynomial 
in n. Therefore, if a problem and its approximate solution are structured, we take advantage 
of this structure to avoid an exponential blowup in the computation time. At the same time, 
the quality of the policies is not deteriorating with increasing problem size n. 

Second, the MCMC solver (N = 250) achieves the highest objective values on all solved 
problems. Higher objective values are interpreted as closer approximations to the constraint 
space in HALF since the solvers operate on relaxed formulations of HALF. Third, the quality 
of the MCMC-HALP pohcies {N = 250) surpasses the MC-HALP pohcies {N = 10^) while 
both solvers consume approximately the same computation time. This result is due to the 
informative search for violated constraints in the MCMC-HALP solver. Fourth, the quality 
of the MCMC-HALP pohcies (A^ = 250) is close to the e-HALP pohcies (e = 1/16) ahhough 
there is a significant difference between their objective values. Further analysis shows that 
the shape of the value functions is similar (Figure 17) and they differ the most in the weight 
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Figure 15: Comparison of three HALP solvers on two irrigation network topologies of vary- 
ing sizes (n). The solvers are compared by the objective value of a relaxed HALP 
(OV), the expected discounted reward of a corresponding policy, and computa- 
tion time (in seconds). The e-HALP, MCMC-HALP, and MC-HALP solvers are 
parameterized by the resolution of e-grid (e), the number of MCMC chains (N), 
and the number of samples (N). Note that the quality of policies improves with 
higher grid resolution (1/e) and larger sample size {N). Upper bounds on their 
expected returns are shown in the last rows of the tables. 



of the constant basis function /o(x) = 1. Note that increasing does not affect the quality 
of a greedy policy for . However, this trick allows the satisfaction of the constraint space 
in HALP (Section 5.1). 

Finally, the computation time of the £-HALP solver is seriously affected by the topologies 
of the irrigation networks, which can be explained as follows. For a small e and large n, the 
time complexity of formulating cost networks for the ring and ring-of-rings topologies grows 
by the rates of [1/e + 1]^ and [1/e + 1]^, respectively. Since the e-HALP method consumes 
a significant amount of time by constructing cost networks, its quadratic (in |"l/e -|- 1]) time 
complexity on the ring topology worsens to cubic (in [1/e + 1] ) on the ring-of-rings topology. 
On the other hand, a similar cross-topology comparison of the MCMC-HALP solver shows 
that its computation times differ only by a multiplicative factor of 2. This difference is due 
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Figure 16: Scale-up potential of the e-HALP, MCMC-HALP, and MC-HALP solvers on two 
irrigation network topologies of varying sizes (n). The graphs show the expected 
discounted reward of policies and their computation time (in hours) as functions 
of n. The HALP solvers are parameterized by the resolution of e-grid (e = 1/16), 
the number of MCMC chains [N = 250), and the number of samples [N = 10^). 
Note that all trends can be approximated by a polynomial f(n) (gray line) with 
a high degree of confidence (the coefficient of determination i?^), where c denotes 
a constant independent of n. 



to the increased complexity of sampling p(z^ \ z_j), which results from more complex local 
dependencies in the ring-of-rings topology and not its treewidth. 

Before we proceed, note that our relaxed formulations (Figure 15) have significantly less 
constraints than their complete sets (Section 6.3). For instance, the MC-HALP formulation 
(N = 10^) on the 6 -ring irrigation network problem is originally established by 10® randomly 
sampled constraints. Based on our empirical results, the constraints can be satisfied greedily 
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Figure 17: Univariate projections V'^{x.)\xj = YlrXj=Xi "^ifii^i) of approximate value func- 
tions on the 6-ring irrigation network problem (Figure 13a). These functions 
are learned from 40 basis functions (Figure 14c) by the e-HALP, MCMC-HALP, 
and MC-HALP solvers. The solvers are parameterized by the resolution of e-grid 
(e = 1/16), the number of MCMC chains {N = 250), and the number of samples 
[N = 10®). Note that the univariate projections V^{^)\xj are very similar. The 
proximity of their greedy policies can be explained based on this observation. 
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Figure 18: Comparison of three HALP solvers on the 3x3 grid irrigation network problem 
(Figure 13). The solvers are compared by the objective value of a relaxed HALP 
(OV), the expected discounted reward of a corresponding policy, and computa- 
tion time (in seconds). The £-HALP, MCMC-HALP, and MC-HALP solvers are 
parameterized by the resolution of e-grid (e), the number of MCMC chains {N), 
and the number of samples {N). Note that the quality of policies improves with 
higher grid resolution (1/e) and larger sample size {N). An upper bound on the 
expected returns is 87.2. 



by a subset of 400 constraints on average (Kveton & Hauskrecht, 2004). Similarly, the oracle 
Cmcmc in the MCMC-HALP formulation (A^ = 250) iterates through 250 x 500 x (10+10) = 
2,500,000 state-action configurations (Figure 11). However, corresponding LP formulations 
involve only 700 constraints on average. 

7.3 The Curse of Treewidth 

In the ring and ring-of-rings topologies, the treewidth of the constraint space (in continuous 
variables) is 2 and 3, respectively. As a result, the oracle can perform variable elimination 
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for a small e, and the e-HALP solver returns close-to-optimal policies. Unfortunately, small 
treewidth is atypical in real-world domains. For instance, the treewidth of a more complex 
3x3 grid irrigation network (Figure 13c) is 6. To perform variable elimination for e = 1/16, 
the separation oracle requires the space of [1/e -|- 1]^«2^'^, which is at the memory limit 
of existing PCs. To analyze the behavior of our separation oracles (Section 6) in this setting, 
we repeat our experiments from Section 7.2 on the 3x3 grid irrigation network. 

Based on the results in Figure 18, we conclude that the time complexity of the e-HALP 
solver grows by the rate of [1/e + 1]^. Therefore, approximate constraint space satisfaction 
(MC-HALP and MCMC-HALP) generates better results than a combinatorial optimization 
on an insufficiently discretizcd e-grid (e-HALP). This conclusion is parallel to those in large 
structured optimization problems with continuous variables. We believe that a combination 
of exact and approximate steps delivers the best tradeoff between the quality and complexity 
of our solutions (Section 6.4). 

8. Conclusions 

Development of scalable algorithms for solving real- world decision problems is a challenging 
task. In this paper, we presented a theoretically sound framework that allows for a compact 
representation and efficient solutions to hybrid factored MDPs. We believe that our results 
can be applied to a variety of optimization problems in robotics, manufacturing, or financial 
mathematics. This work can be extended in several interesting directions. 

First, note that the concept of closed-form solutions to the expectations terms in HALP 
is not limited to the choices in Section 5.2. For instance, we can show that if P{x) and f{x) 
are normal densities, Ep(3,)[/(a;)] has a closed-form solution (Kveton & Hauskrecht, 2006b). 
Therefore, we can directly reason with normal transition functions instead of approximating 
them by a mixture of beta distributions. Similar conchisions are true for piecewise constant, 
piecewise linear, and gamma transition and basis functions. Note that our efficient solutions 
apply to any approach to solving hybrid factored MDPs that approximates the optimal value 
function by a linear combination of basis functions (Equation 5). 

Second, the constraint space in HALP (16) V"" — T*^" > exhibits the same structure 
as the constraint space in approximate policy iteration (API) (Guestrin et al., 2001; Patrascu 
et al., 2002) WV"" — T*V'"\\^ < e, where e is a variable subject to minimization. As a result, 
our work provides a recipe for solving API formulations in hybrid state and action domains. 
In discrete-state spaces, Patrascu et al. (2002) and Guestrin (2003) showed that API returns 
better policies than ALP for the same set of basis functions. Note that API is more complex 
than ALP because every step of API involves satisfying the constraint \\V^ — T*V^\\^ < e 
for some fixed e. 

Third, automatic learning of basis functions seems critical for the application of HALP 
to real-world domains. Patrascu et al. (2002) analyzed this problem in discrete-state spaces 
and proposed a greedy approach to learning basis functions. Kveton and Hauskrecht (2006a) 
generalized these ideas and showed how to learn parametric basis functions in hybrid spaces. 
We believe that a combination of the greedy search with a state space analysis (Mahadevan, 
2005; Mahadevan &; Maggioni, 2006) can yield even better basis functions. 

Finally, we proposed several bounds (Section 5.1 and 6.2.1) that may explain the quality 
of the complete and relaxed HALP formulations. In future, we plan to empirically evaluate 
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their tightness on a variety of low-dimensional hybrid optimization problems (Bresina et al., 
2002; Munos &; Moore, 2002) with known optimal value functions. 
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Appendix A. Proofs 

Proof of Proposition 1: The Bellman operator T* is known to be a contraction mapping. 
Based on its monotonicity, for any value function V,V >T*V implies V > T*V > ■ ■ ■ > V* . 
Since constraints in the HALF formulation (16) enforce > T*V^, we conclude > V*. 
□ 

Proof of Proposition 2: Based on Proposition 1, we note that the constraint > T*V'^ 
guarantees that > V*. Subsequently, our claim is proved by realizing: 

argminEv,[F'^] = argminE^fT/"^ - V*] 

w w 

and 

= Ey,|y*-y^| 

The proof generalizes from the discrete-state case (de Farias &z Van Roy, 2003) without any 
alternations. □ 

Proof of Theorem 2: Similarly to Theorem 2 (de Farias &; Van Roy, 2003), this claim is 
proved in three steps. First, wc find a point w in the feasible region of the HALF such that 
is within 0{e) distance from , where: 

w* = argmin IIT^* — 

w 



Such a point w is given by: 



V* - v 



. , (l + 7)e 

w = w -\ e, 

1-7 



where e = (1,0, ... ,0) is an indicator of the constant basis function /o(x) = 1. This point 
satisfies all requirements and its feasibility can be handily verified by solving: 



1 — 7 \ 1 — 7 

= V^* -r*V'"'' + {l + j)e 
> 0, 
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where the last step follows from the inequality: 





< 

oo 


yw* _ Y* 


+ 

oo 


Y* 'y~* j 


oo 








y* _ yw* 


+ 

oo 


^j-^Y* 'j~^*y^* 


oo 



< (l + 7)e- 

Subsequently, we bound the max-norm error of by using the triangle inequality: 



I lloo — 



y* - y^ 



+ 



1-7 



2e 



1-7' 

which yields a bound on the weighted >Ci-norm error of 



V* - v 



< 



2e 



■7 



The proof generalizes from the discrete-state case (de Farias & Van Roy, 2003) without any 
alternations. □ 

Proof of Theorem 3: Similarly to Theorem 2, this claim is proved in three steps: finding 
a point w in the feasible region of the HALF, bounding the max-norm error of V^, which 
in turn yields a bound on the £i-norm error of V^. A comprehensive proof for the discrete- 
state case was done by de Farias and Van Roy (2003). This proof generalizes to structured 
state and action spaces with continuous variables. □ 

Proof of Proposition 3: The proposition is proved in a sequence of steps: 



E 



P(x) 



[fix)] 



PhcUx I a,(3)x^{l-xrdx 

r(a + /3) r(a + n)r(/3 + m) f r(a + /3 + n + m) _ .0+rn-i . 

r{a)T{(3) r{a + P + n + m) J^T{a + n)r(/3 + m) ^ ' 

r{a + (3) T{a + n)T{/3 + m) 
r(a)r(/3) r{a + P + n + m) 



/ Pbetc,{x \ a + n,P + m)dx . 

J X 
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Since integration is a distributive operation, our claim straightforwardly generalizes to the 
mixture of beta distributions P{x). □ 

Proof of Proposition 4: The proposition is proved in a sequence of steps: 

^P{x)[f{^)] = Phete.{x\a,P)'^lii.^^.^{x){aiX + bi)dx 
•^^ i 

l-ri 

= XI / Phetsiix \ a, P){aiX + hi) dx 

i 

= E 

i 

= E 

i 

= E 



/ P]:,exs.{x\a,l3)xdx + hi I P\,cta.{x\a,l3)dx 

k 



a 



a + P 
a 



/ -Pbeta(a^ \a + l,P)dx + hi / Pbeta(a; I a, /3) dx 
Jli Jii 

{F+in) - F+{li)) + hiiFin) - F{li)) 



Since integration is a distributive operation, our claim straightforwardly generalizes to the 
mixture of beta distributions P{x). □ 

Proof of Proposition 5: This claim is proved in three steps. First, we construct a point 
w in the feasible region of the HALF such that is within 0{S) distance from . Such 
a point w is given by: 

_ ^ S 
w = w + 



1-7 ' 

where e = (1, 0, . . . , 0) is an indicator of the constant basis function /o(x) = 1. This point 
satisfies all requirements and its feasibility can be handily verified by solving: 



- r*y^ + 



7(5 
1^ 



1-7 
> 0, 

where the inequality —T*V^ > —6 holds from the (5-infeasibility of w. Since the optimal 
solution w is feasible in the relaxed HALF, we conclude [y^] < [l/^] . Subsequently, 
this inequality yields a bound on the weighted £i-norm error of V^: 



\V* - V 



= E^ 
= E^ 

< E^ 



+ 



1-7 
5 



V* 



V* - v 



1-7 
S 

6 

i.V- 1-7 
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Finally, we combine this result with the triangle inequality: 



V* - v 



< Wv* - v 



1,^ 



\l,1p 



< 



V* - v 



+ 



2S 



I,-!/' 1-7 

which leads to a bound on the weighted £i-norm error of . □ 
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